Members: Hongning Yu, Hui Jiang, Hao Pan
The dataset we use is a lyrics dataset (lyrics from MetrLyrics), which can be downloaded from Kaggle for free: https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics. By exploring this dataset, we are able to know the key features of certain song genre and predict the corresponding genre for new songs.
In this dataset, there are 362237 records and 5 features (song name, year, artist, genre, and lyrics). It is comprised of text documents and contains only text divided into documents. Besides, we can predict song genres according to lyrics, so it meets requirements for Lab 2.
For this project, our mainly purpose is to find the features for different song genres by analyzing the most frequent words in lyrics. And visualizing features will reveal more information about those features in the dataset. And then we may be able to figure out the relationship among features, which might benefit our genre prediction as well.
The statictic and prediction results can be applied to applications related to song searching or displaying. For example, song searching applications, like Siri may use when you ask her "What song is it?", can narrow down song searching scope by classify songs according to lyric features. As for song displaying application, it could reconmend songs by analyzing lyrics from users' favorite songs.
To ensure the correct rate of our prediction, we will keep a predict accuracy(AUC) target, like 80%, using accuracy measurement functions. We will use other more helpful evaluation metrics and functions if needed.
First let's load the data in to dataframe. The data is already in a csv file but all of the lyrics are in raw text with different formats. Our gold is to predict genre basing on lyrics, so we still need to clean all lyrics.
import pandas as pd
import nltk
import numpy as np
import string
pd.set_option('display.max_columns', 60)
df = pd.read_csv("./lyrics.csv", encoding="utf-8")
df.head()
df.isnull().sum()
Looks like there are null values in lyrics and song. Just drop them.
df.dropna(inplace=True)
df.isnull().sum()
df.genre.value_counts()
As we can see, some genres have way more records than others. For our genre-predicting classification problem, we could sample the dataset and choose subsets of some genres to avoid bias. But let's now keep it as it is and deal with this later.
Check certain genres:
df.info()
First let's try to get rid of all non-ascii characters, since we only want english characters
Takes too much time
# %%time
# import re
# for row in df.index[:1000]:
# df.loc[row, 'lyrics'] = df.loc[row, 'lyrics'].encode('ascii', errors='ignore').decode()
# for row in df.index[:1000]:
# df.loc[row, 'lyrics'] = re.sub(r'[^\x00-\x7f]',
# r'',
# df.loc[row, 'lyrics'])
We want to focus on song's with english lyrics, so let's delete all non-english records if they exist.
I tried to build a English-ratio detector to eliminate all non-english songs. Reference: https://github.com/rasbt/musicmood/blob/master/code/collect_data/data_collection.ipynb
But the loop of set calculation takes too much time. Need to improve.
# %%time
# def eng_ratio(text):
# ''' Returns the ratio of non-English to English words from a text '''
# english_vocab = set(w.lower() for w in nltk.corpus.words.words())
# text_vocab = set(w.lower() for w in text.split('-') if w.lower().isalpha())
# unusual = text_vocab.difference(english_vocab)
# diff = len(unusual)/(len(text_vocab)+1)
# return diff
# # first let's eliminate non-english songs by their names
# before = df.shape[0]
# for row_id in range(100):
# text = df.loc[row_id]['song']
# diff = eng_ratio(text)
# if diff >= 0.5:
# df = df[df.index != row_id]
# after = df.shape[0]
# rem = before - after
# print('%s have been removed.' %rem)
# print('%s songs remain in the dataset.' %after)
This is another approach, which uses a package from https://github.com/saffsd/langid.py. This package can detect language in a fairly quicker way. But still, 260k records takes around 50 mins.
# # package from https://github.com/saffsd/langid.py
# import langid
# before = df.shape[0]
# for row in df.index:
# lang = langid.classify(df.loc[row]['lyrics'])[0]
# if lang != 'en':
# df = df[df.index != row]
# after = df.shape[0]
# rem = before - after
# print('%s have been removed.' %rem)
# print('%s songs remain in the dataset.' %after)
# df.to_csv('lyrics_new.csv',index_label='index')
df = pd.read_csv("./lyrics_new.csv", encoding="utf-8").drop('index.1', axis=1)
df.genre.value_counts()
300k records easily run out of memory. So I tried to resample the dataset and choose equal size of each genre.
grouped = df.groupby('genre')
df_sample = grouped.apply(lambda x: x.sample(n=1800, random_state=7))
print("Size of dataframe: {}".format(df_sample.shape[0]))
df_sample.genre.value_counts()
# reset index means remove index (and change index to a column if not drop)
df_sample.reset_index(drop=True, inplace=True)
df_sample.head(10)
# check lyrics with length less than 100
less_than_100 = 0
for row in df_sample.index[:1000]:
if len(df_sample.loc[row]['lyrics'])<=100:
print(df_sample.loc[row]['lyrics'])
less_than_100 += 1
print("\nNum of lyrics with length less than 100 in first 1000: {}".format(less_than_100))
It looks like lots of songs don't have meaningful lyrics(instrumental music, or something wrong happened when crawling).
So we just drop all song records with less than 100 lyric length
print("Deleting records with lyric length < 100")
len_before = df_sample.shape[0]
df_clean = df_sample.copy()
for row in df_clean.index:
if len(df_clean.loc[row]['lyrics']) <= 100:
df_clean.drop(row, inplace=True)
len_after = df_clean.shape[0]
print("Before: {}\nAfter : {}\nDeleted: {}".format(len_before, len_after, len_before-len_after))
df_clean.genre.value_counts()
x = df_clean['lyrics'].values
y = df_clean['genre'].values
print('Size of x: {}\nSize of y: {}'.format(x.size, y.size))
x = x.tolist()
x[1]
# def count_sentence_len(lyric):
# """count average sentence len for a lyric"""
# sents_list = lyric.split('\n')
# avg_len = sum(len(x.split()) for x in sents_list) / len(sents_list)
# return avg_len
# sentence_length_avg = []
x_clean = []
translator = str.maketrans('', '', string.punctuation)
for l in x:
l = l.translate(translator)
# sentence_len = count_sentence_len(l)
# sentence_length_avg.append(sentence_len)
l = l.replace('\n', ' ')
x_clean.append(l)
# randomly print 5 lyrics
import random
for i in random.sample(range(len(x_clean)), 5):
print(x_clean[i])
print("=============================")
print(len(x_clean))
nltk package has a build in library of stop words. Here I build my own stop-words dictionary basing on sklearn buildin stop word dictionary.
%%time
x_clean = [x.lower() for x in x_clean]
x_clean_new = []
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
stop_words = list(ENGLISH_STOP_WORDS)
stop_words = stop_words + ['will', 'got', 'ill', 'im', 'let']
for text in x_clean:
text = ' '.join([word for word in text.split() if word not in stop_words])
x_clean_new.append(text)
x_clean = x_clean_new
Here I used a english dictionary from https://github.com/eclarson/MachineLearningNotebooks/tree/master/data
with open('./ospd.txt', encoding='utf-8', errors='ignore') as f1:
vocab1 = f1.read().split("\n")
print(len(vocab1))
from sklearn.feature_extraction.text import CountVectorizer
# CounterVectorizer can automatically change words into lower case
cv = CountVectorizer(stop_words='english',
encoding='utf-8',
lowercase=True,
vocabulary=vocab1)
bag_words = cv.fit_transform(x_clean)
print('Shape of bag words: {}'.format(bag_words.shape))
print("Length of Vocabulary: {}".format(len(cv.vocabulary_)))
Let's createe a pandas dataframe containing bag-of-words(bow) model
df_bow = pd.DataFrame(data=bag_words.toarray(),columns=cv.get_feature_names())
df_bow.head()
%%time
word_freq = df_bow.sum().sort_values(ascending=False)
word_freq[:30]
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words='english',
encoding='utf-8',
lowercase=True,
vocabulary=vocab1)
tfidf_mat = tfidf_vect.fit_transform(x_clean)
print('Shape of bag words: {}'.format(tfidf_mat.shape))
print("Length of Vocabulary: {}".format(len(tfidf_vect.vocabulary_)))
df_tfidf = pd.DataFrame(data=tfidf_mat.toarray(),columns=tfidf_vect.get_feature_names())
df_tfidf
%%time
word_score = df_tfidf.sum().sort_values(ascending=False)
word_score[:30]
We can also calculate the corelation matrix, where number in each position (i,j) represents the correlation between song i and song j.
corr = (tfidf_mat * tfidf_mat.T).A
corr.shape
df_clean.head()
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
plt.style.use('ggplot')
freq = pd.DataFrame(word_freq, columns = ['frequency'])
fig = freq[:20].plot(kind = 'barh', figsize=(9,8), fontsize=18)
# plt.legend('number of occurrences', loc = 'upper right')
plt.gca().invert_yaxis()
plt.title('words frequencies', fontsize=20)
As we can see in this histogram, the top frequent words are "love", "know", "like" and so on. Among these top 20 frequent words listed in the histogram, the frequency of the top 4 words (love, know, like, just) is almost trible of the last 3 words, i.e. there's a considerable difference between the frequency of different words. One more thing we notice is that, there is a interjection in the list, "oh", and it is the top 6 frequent word. We didn't even notice artists used so many "oh" in the lyrics!
score = pd.DataFrame(word_score, columns = ['Score'])
ax = score[:20].plot(kind = 'barh', figsize=(9,8), fontsize=18)
plt.legend('score', loc = 'lower right', fontsize=15)
plt.gca().invert_yaxis()
plt.title('tf-idf score')
To figure out the most frequency word for each genre, TF-IDF may be more appropriate (given that TF-IDF reflects how important words to the document). From the plot above, we can see that the top frequent words are totally different from those words listed according to term frequency.
And we can see that there are some words, like "al", "bo", "dor", "la", have high TF-IDF score. This may due to the phynominon that these words only exist in some documents (songs), makes them so "special" and are highlighted as important words for documents.
TF-IDF analysis for each genre is needed.
# code example from https://www.kaggle.com/carrie1/drug-of-choice-by-genre-using-song-lyrics
df_clean['word_count'] = df_clean['lyrics'].str.split().str.len()
df_clean.info()
f, ax = plt.subplots(figsize=(10, 9))
sns.violinplot(x = df_clean.word_count)
plt.xlim(-100, 1000)
plt.title('Word count distribution', fontsize=26)
The violinplot plot the distribution of all the songs according to number of words of lyrics.
The figure shows that most of songs have lyric length form 100 to 300 words. The lyric length median locates around 200. And only a small part of songs' lyric length longer than 400.
This make sense for the real lyrics length. After all, people may get tired of songs with too many lyrics and are more unlikely to fall in love with the songs only have a few words.
Above plot is for all lyrics, without classifying by genre. We still cannot get the desired feature for each genre.
f, ax = plt.subplots(figsize=(10, 9))
sns.boxplot(x = "genre", y = "word_count", data = df_clean, palette = "Set1")
plt.ylim(1,2000)
To figure out the lyric length feature for each genre. We group the data by genre and get box plot for each genre.
According to the plot, medians of most box are under 250 (around 200). Only the median for Hip-Hop is around 500, more than double length than the others. For the maximum, Electronic, Rock, Hip-Hop have the top 3 longest lyrics. And there's no big difference for the minimum for all the genres.
In general, the top 5 longest lyrics genres(named) are Hip-Hop, Pop, R&B, Electronic, Indie. The last 3 genres(named) are Jazz, Metal, Country. It seems that the genres with up tempo are more likely to have longer lyrics, and vice-versa. But we still need to pay attention to some exception. Metal songs with up tempo, however, mostly they have shorter lyrics than the other up-tempo songs. Thus, the length of lyrics can be a reference for genre classification but should not be the decision metric.
mpl.rc("figure", figsize=(12,12))
sns.violinplot(x='genre', y='year', data=df_clean)
Looks like the distribution is biased with extreme values. So let's check outliers.
df_clean[df_clean['year'] <= 2000].shape[0]
Drop songs before 2000 and plot again.
for row in df_clean[df_clean['year'] <= 2000].index:
df_clean.drop(row, inplace=True)
mpl.rc("figure", figsize=(15, 25))
sns.violinplot(x='year', y='genre', data=df_clean, inner="quartile")
We can see that the distributions are quite different. Country, Metal, Pop, R&B and Rock have a more centrilized distribution, mostly created during 2005~2010. Other genres have a quite streched distribution. Other songs(song's not labled with a genre) are mostly composed after 2012, propably because new songs don't have labels yet.
Several geners had a big-bang around 2006~2009. We are wondering if this distribution was due to reality or just crawlling problems
top_artist = df.artist.value_counts().head(8).index.tolist()
# df_clean['artist'].isin(top_artist)
# df_clean.loc[df_clean['artist'] in]
df_top_artist = df_clean.loc[df_clean['artist'].isin(top_artist), :]
df_top_artist.head()
df_top_artist.info()
mpl.rc("figure", figsize=(25, 15))
sns.violinplot(x='artist', y='year', data=df_top_artist, inner="quartile")
sns.set(font_scale=3)
For the top 8 artists, we plot this figure to explore their high-yield years. For artist eddy-arnold, dolly-parton, eminem, barba-streisan and bee-gees, their most songs were composed during 2005~2010. And for the cris-crown and bob dylan, it seems they kept creating for a long time. However, bob-dylan's works are sort of "ahead of time". It may due to the mis-input of the information.
print(df_bow.shape)
print(len(y))
df_bow['length'] = df_bow.sum(axis=1)
# create two new columns:
# @ length: length of documents basing on bag-of-word model
# @ genre: genre of the record
df_bow['genre'] = pd.Series(y).values
mpl.rc("figure", figsize=(25, 15))
sns.violinplot(x='length', y='genre', data=df_bow, inner="quartile")
sns.set(font_scale=3)
This is another way to calculate lyrics' length basing on word bags. The violin plot for the lyric length among each genre plot corresponding to the box-plot above.
Next we want to check the top 10 frequent words of each genre.
genre_count = df_bow.groupby('genre').sum()
genre_count.drop('length', axis=1, inplace=True)
genre_count.head()
genre_count_new = genre_count.transpose()
genre_list = df_clean.genre.unique().tolist()
for genre in genre_list:
t = genre_count_new.nlargest(10, genre, keep='first')[genre]
fig = plt.figure(figsize=(6,4))
fig.suptitle(genre, fontsize=20)
plt.xticks(rotation='vertical')
sns.barplot(t.values, t.index, alpha=0.8)
sns.set(font_scale=3)
In above histogram, we list some top frequent words for each genre. For different genres, they have top-10 frequent words in common and. And these information can be visualized in word cloud figures in Part 4.
From those histograms, it is pretty straightforward that 'love' is almost every types of music cared about. And also other words they share in common, which are 'know','time', 'oh' etc. And also many of those words are verbs. It looks like hip-hop music has a quite differnet set of frequent words, distinctive from other genres.
Now it is 'wordcloud' time. Word cloud is a visual representation of text data, and it is a very efficient way to represent word frequencies.
First let's try to draw the overall wordcloud basing on term frequency.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
plt.style.use('ggplot')
all_lyrics = ''
for lyric in x_clean:
all_lyrics += (' '+lyric)
# code example from https://amueller.github.io/word_cloud/index.html
wordcloud = WordCloud(max_font_size=60).generate(all_lyrics)
import matplotlib.pyplot as plt
plt.figure(figsize=(15,15))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
We can clearly see that the most frequently used word is 'love' over all, then comes with 'got',
word_freq[:30]
As we can see, the word cloud describes word frequency in a visuable way.
Let's try plot word clouds in different genres.
d = {'genre': y.tolist(), "lyric": x_clean}
df_plot = pd.DataFrame(d)
df_plot.head(10)
Now let's separate those lyrics into different genres.
# create a dictionary and store all lyrics basing on their genre
lyrics = {}
for genre in df_plot.genre.unique().tolist():
lyrics[genre] = ' '
for row in (df_plot[df_plot['genre'] == genre].index):
lyrics[genre] = lyrics[genre] + ' ' + df_plot.loc[row, 'lyric']
for genre, lyric in lyrics.items():
wordcloud = WordCloud(max_font_size=60).generate(lyric)
fig = plt.figure(figsize=(10,8))
fig.suptitle(genre, fontsize=24)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout()
In thoes word cloud, word 'love' is almost the most frequent one in each genre. And word 'life', 'know' and 'time' etc. are frequently used as well. But there are also some differences among those genres. For example, in jazz, word 'heart' used more that other genre, and there are more dirty words in hip-hop, which makes sense. After exploring all the lyrics, we can make a conclusion that most of the lyrics have some words in common, but depending on what kind of music they are, they do have unique words. Based on this results, we can make genre prediction in the future.
Raschka, S. (2015). Python machine learning. Packt Publishing Ltd.
https://www.kaggle.com/carrie1/drug-of-choice-by-genre-using-song-lyrics
https://github.com/eclarson/MachineLearningNotebooks/tree/master/data